Math 419/592 Project Rules and Ideas, Version 1

Winter 2008

Project Rules

Project proposals are required from everyone. A proposal should be at least 1 page long, and will usually include roughly 3 references. The proposal should also specify a journal that the results of this project might be submitted to, plus a secondary (backup) journal. If you need to submit an Inter-Library Loan request, remember to do it early.

Project reports should be in the form of a submission to a specific journal that you think would be the most appropriate. Prof. Ross will let you know if you should actually submit it to the journal; depending on the amount of help you get, a co-authorship may be appropriate. A typical organization for a paper is:

Title, authors, affiliation
Abstract and keywords
Introduction
Literature Review
Model
Methods
Results
Conclusions
Appendix: computer code, other lengthy material

though this may be changed somewhat. Try to emulate (but do not copy!) well-written papers that you have read.

Here is the timeline for the project:

10% Feb.  9: project proposal (>=1 pages)
40% Feb. 23: project submission to Prof. Ross
--% Mar.  6: referee reports returned
40% Mar. 13: project revisions submitted to Prof. Ross
10% Mar. 15: presentations (5 minutes per project)

If you are happy with the grade you receive for the first submission and elect to do nothing, then you need not turn in a revision. In that case, you will receive the same grade for the second 40% as you did for the first 40%. If you miss the first submission deadline, your maximum score on the overall project will be (10+0+40+10)=60; this is, how shall I put this, not recommended. Similarly, turning in an incomplete paper for that first submission will drag down your score, so I recommend to you that you make good use of my office hours to get guidance on your projects.

Also, please make a copy of most of the sources you referenced, so I can read what they have to say.

Project Ideas

This is a preliminary list of project ideas. More ideas are sure to follow. If you have an idea you don't see on this list, please let me know. The lists are divided into mid-term and final lists, mainly because some topics won't be covered until the second half of the semester.

It is important to realize: most of these project ideas will need guidance from Prof. Ross. If you try to do them by just reading these tiny descriptions, things will go very badly. You should plan on meeting with Prof. Ross to discuss possible projects, a week or so before the proposal deadline.

One of the student course evaluations from last year said "Give us more time on the projects." Well, the semester is 15 weeks long or so, broken into halves, and you're free to start the project(s) during week 1. I don't see how I can give you more time. Use the time you have wisely.

Note: DTMC = Discrete-Time Markov Chain, CTMC = Continuous-Time Markov Chain.

Topics that I would particularly like to see done are marked with a star (*).

Mid-Term Project

farecast.com: farecast.com tries to predict how airline fares will change in time: should you buy now, or wait for the price to go down? It's clearly a stochastic process. I haven't thought much more about it. It may be more appropriate for the second half of the semester. There are other, competing sites as well.
Pavement Maintenance as a Markov Decision Process: Each year, you can do minor patching on a piece of road, or major patching, or repave the whole thing. It will degrade according to a random process. Look at this as a Markov Decision Process, prompted by papers by Pablo Durango.
Spam Filtering: Read various articles about spam filtering techniques. Maybe implement a filter for yourself.
2-dimensional Poisson Processes, and others: It's fairly easy to simulate a 2-dimensional Poisson process in Excel, SciLab, etc. It's not so obvious how to simulate other 2-d processes that are more or less bursty. Read up on models that are more or less bursty than Poisson, and figure out how to simulate them in Excel or Scilab, hopefully in an efficient manner.
Hyperexponential Helper and Coxian Companion: Create two spreadsheets or Matlab programs that will help users design Hyperexponential or Coxian distributions that have specified values for their mean, variance, and 3rd moment. See Prof. Ross for ideas on user interface, etc.
Approximations for Ej/Ek overtaking probabilities: What is the probability that someone who arrives after you will leave before you, in a multi-server system? We can calculate it exactly using a DTMC for Erlang arrivals and service durations. But is there a faster way to approximate it? Do some of the detailed calculations using pre-written Matlab code, do graphs, and fit some curves. Problem size: as small as 10x10, possibly up to 1000x1000 or so.
Backgammon Simulation: A simulation for Backgammon that Prof. Ross did some years ago had some odd results. Re-run the simulation and see if the random number generator is causing the problems. Little or no knowledge of Backgammon required.
Stochastic Orderings Illustrated: Read about Stochastic Orderings and various distribution classes (New Better than Used in Expectation, etc.). Find ways to diagram the relationships between classes and orderings. Do some graphs to illustrate various orderings and classes. Prof. Ross can provide some suggestions.
Search Engine Rankings: Google (and probably other search engines) use something like a DTMC to rank pages. Read up on how they do this, and maybe do a miniature version for yourself. This might be related to the "build a better recommender" class that our CompSci department is running right now (Winter 2007).
* Stochastics and Weather Models: What does it mean when a weather forecast says "40 percent chance of rain" or "4-5 inches of snow tonight"? What is a typical correlation between two nearby regions in terms of rainfall? Read meteorology textbooks and journal articles to get a comprehensive picture of how stochastics and weather models interact. Here's an idea to focus your attention: suppose you work for a company whose operations are affected by the weather: a ski lodge, an airline, a school district or university that might shut due to snow, an amusement park, a city public-works (snowplow) department, etc. What would you want to know about weather forecasting to do your job well?
Linear Algebra/DTMC lecture: Prepare and give a 50-minute lecture intended for a linear algebra class (like Math 205) on the wonders of DTMCs. Include: matrix multiplication as transient calculations, linear-systems for steady-state probabilities, eigenvalues to show that the transition matrix is non-invertible, and maybe first-passage probabilities. Use PowerPoint or LaTeX slides.
DTMCs and Inventory: Learn some basic DTMC models for inventory. Either do an in-depth literature review, or a partial literature review plus some computations.
Other literature review: Do a literature review for some field that you're interested in, with at least some overlap with stochastics.
DTMC numerical solvers: There are a number of ways to find the steady-state solution for a DTMC. Program one or two of the moderately advanced ones, such as Gauss-Seidel or SOR. Hopefully in Matlab, but maybe in Excel.
* Quasi-Random Numbers: When we use functions like =RAND() on Excel, it generates what are called pseudo-random numbers. Quasi-random numbers are different: they are deliberately spaced (and correlated) so that they are more uniform. Figure out how to generate quasi-random values in Excel, OpenOffice Calc, Matlab, or Scilab. Other questions to ask: what does a Poisson process generated with quasi-random numbers look like/how does it behave differently? If I use a quasi-random Poisson process as the arrival stream for a queue (possibly with quasi-random service as well), how does the queue behave?
* Geom/Geom/1 queue waiting times: In the simplest discrete-time queueing system, the time that someone spends waiting has a Geometric distribution. The time that the next person spends waiting is also Geometric, and the two are dependent (if one person waits a long time, the person behind them will probably wait a long time too). Set up a DTMC model and solve for its transient probabilities to compute the correlation coefficient, etc. Problem size: maybe a few hundred states? Probably better for Matlab than Excel.
Board games as DTMCs: Monopoly (the board game) has been analyzed as a DTMC. Do the same for some other board game(s). Problem size: Candyland, Snakes-and-Ladders, and Monopoly would each be in the 50x50 to 100x100 range, I think. Also, see the extra comments at the bottom of this document.
DNA or Protein as a DTMC: Note: this project was done last year in a less-than-formal way. If you do it this year, you should expect to do formal statistical tests. Look for some papers that discuss whether DNA sequences or protein sequences follow DTMCs. Take some sequence data (freely available on the internet) and analyze it to see if it's a DTMC. For extreme extra credit, generate DNA or protein sequences by a DTMC, construct the molecules, and see if they cure any major diseases. Another possibility is to get two separate DTMC data sets: one from genes, and the other from regions of junk DNA. Then, take a new sequence of DNA (gene or junk) and compute the likelihood it came from one DTMC or the other. This would require a fair amount of coding. Problem size: 4x4, 16x16, 64x64 at least--maybe also 256x256
Music as a Markov Chain: Again, this suggestion is left over from last year. I highly discourage it, unless you plan on doing a very formal analysis. For example, formal statistical tests to see if the music data does form a DTMC might be acceptable. Problem size: 8x8 (one diatonic octave, one-note history), 13x13 (one chromatic octave, one-note history), approximately 64x64 (one diatonic octave, two-note history), etc.
Babies learn language by DTMC?: Again, this project is discouraged due to the lack of formality. A recent paper proposed that babies find the divisions between spoken words by using a DTMC sort of process. Read that paper and any related papers. Do a small experiment using some set of language data. Problem size: in the range of 100x100 to 1000x1000
Take average or maximum of scores?: In the Olympics, some sports are scored using the average of two performances, while others use the maximum of two performances. How would either scoring system affect your strategy on the first performance, and then on the second performance? Try to come up with a mathematical model that includes the degree to which you "go for it" on each performance, plus some randomness in the results of your own efforts. Similarly, some on-line homework systems allow students to re-take the homework (using newly generated random numbers) over and over again. The professor has the option of using their best score, last score, or average score. What are the implications for student tactics, and what should the professor choose?

Final Project

Analyze the show "1 vs. 100"

The game show "1 vs. 100" has some interesting decisions to be made, about whether you should quit now and go home with your accumulated winnings, or keep going and hope for more winnings. Similar problems are present in Deal or No Deal, and Who Wants to be a Millionaire (how's that for dated!), but 1 vs. 100 is more complicated.

Multi-Armed Bandit Problem

Suppose you are seated at a collection of slot machines. You don't know the probability of payout on each one (they may all be different). You'd like to find out which one is the best, and only play that one, but to do that you have to play all of them a little bit to get an idea of how they pay out. What should you do with your next coin? Should you play it on the best one you've found so far, or should you try a different one to increase your knowledge of its payout probability? There may be an analogy here to how PBS TV stations do fundraising drives. They play a bunch of special shows, then see which ones did well, then play those some more.

Mt/D/c program

Mt means a time-varying Poisson arrival process; D means deterministic service times; "c" means a user-specified number of servers. Do the transient calculations by creating and multiplying transition matrices in a DTMC, using VBA and a spreadsheet interface. Or, use Matlab. Technically this is just a DTMC rather than a CTMC, but it's still okay for the final project.

M/D/c approximation

We have a proposed approximation for the number of servers that an M/D/c system would need in order to provide good service. Evaluate some actual M/D/c systems to see how many servers they actually need, and compare to the approximation. Technically this is just a DTMC rather than a CTMC, but it's still okay for the final project.

Geom/Geom queues

Geom/Geom is the discrete equivalent of an M/M queue. Find and summarize some seminal papers, and implement algorithms in Excel, Matlab, or other package. Numerically compare to M/M queues. Technically this is just a DTMC rather than a CTMC, but it's still okay for the final project. Or, you could do it as a midterm project.

M/E2/c program

Write a program (VBA or Matlab or ???) to evaluate the performance for an M/E2/c queueing system. This is a two-dimensional CTMC. Once it is written, it will be fairly easy to extend it to H2 or C2 service.

Mt/Mt/c/K program

Write some extensions to an existing VBA/spreadsheet program that helps analyze queues whose arrival and service rates change hour-by-hour. For example, one extension is to allow arrival rates to vary according to the number of people in the system (called a machine-repair model). Write a report describing the theory of the program.

Dam Queues

No, I'm not frustrated with queues. I mean, consider the process of water inflow/outflow at a dam as a queue. Find seminal papers, etc. Perhaps consider networks of dams, and also the economics of selling electricity from hydro-power, which can be related to Dynamic Programming as well.

* Counter-Intuitive Behavior in Queues

Usually, increasing the arrival rate, or the mean service duration, makes a queue perform worse. Also, increasing the variance usually makes things worse. However, there are some cases where this is not the case. Do a literature search for such examples, and try to find unifying themes. J.B. Atkinson, 2000 is a good paper. Gross and Harris, page 180, mentions a few results in queueing networks.

* Handicapped Parking Spaces

Research guidelines for the number of handicapped spaces in a parking lot. What is the reasoning behind the guidelines? Does it compare to what Linda Green discovered about the reasoning behind bed occupancy targets for hospitals? How about guidelines for number of handicapped stalls in restrooms?

* Markovian Arrival Process library

Prof. Ross has written a Matlab library that deals with continuous phase-type distributions (which are based on CTMCs). This can be extended to work with Markovian Arrival Processes, which allow some dependence from one inter-arrival to the next.

* Discrete Phase-Type library

Prof. Ross has written a Matlab library that deals with continuous phase-type distributions (which are based on CTMCs). Translate it to work with discrete phase-type distributions (which are based on DTMCs). Also see "Fitting discrete distributions on the first two moments" by Adan, Van Eenige, and Resing.

* CTMCs with negative rates

Sometimes we can bend the rules with CTMCs and give them negative rates. It's unclear when this will work and when it won't. It may also be related to matrix-exponential distributions. Do a literature search, and a few experiments. I can provide some places to start.

* DTMCs with negative probabilities

See if we can translate the above project (CTMCs with negative rates) into DTMCs.

* Stochastic models of sequential lead times

If you order from your supplier today, and then order again tomorrow, your shipments will probably arrive in the same order. That is, the later one won't overtake the earlier one. This is called "sequential lead times", and is a common assumption in some inventory models. Do a literature search on the details of these kinds of models. Also, do a simulation of some sequential lead times under different distributions.

* Airport security waiting times

Download http://waittime.tsa.dhs.gov/waittimes-xml.zip and play around with the data. Do some interesting graphs. Look for unexpected things. See how often the file gets updated.

* Stochastic Models of Alternative Energy

When planning how to meet electric power demand over the next 24 or 48 hours, we often assume that generators will perform as we tell them to. This works pretty well for nuclear, coal, hydro, and gas, but new alternative energy sources (mainly wind and solar) are more random. Do a literature survey on stochastic/statistical models for their generation abilities. Among other things, try to find out:

how much correlation is there from one windmill to the next in a wind farm?
How much correlation is there between a solar farm and a wind farm?
How much correlation is there between either and energy demand? (e.g. hot day = lots of sunshine = lots of air conditioning demand)

Brownian Motion and Inventory

Does anybody ever use Brownian motion or related processes (Ornstein-Uhlenbeck, etc.) to model supply chain inventories? Do a literature survey.

Financial Engineering/Insurance

Look into the basic problem of Insurance: calculating the ruin probability. Find and summarize some seminal papers, and do a numerical investigation of a typical problem.

Financial Engineering/Black-Scholes Formula lecture

Read the textbook section on Black-Scholes, and look up some of the original journal papers. Prepare and deliver a 50-minute lecture on the topic with PowerPoint or LaTeX slides.

Renewal Theory for an industrial problem

Prof. Storer shared a renewal-theory problem with me, about barrels that can get damaged and pulled from service, or expire after 2 years no matter what. Compute the renewal function (not easy!), and draw some graphs. Also see W. Stadje: Note on the renewal process for truncated exponential variables. Statistics and Probability Letters 6 , 61 - 66 (1987).

Renewal Theory: Inspection Paradox website

Prof. Ross has an idea for a web site that will help illustrate the inspection paradox. Help him implement it, and write a report. We might be able to get away with plain HTML, but it may require Java or Flash.

Is it a Renewal Process?

Analyze some physical process (asteroid strikes, earthquakes, radiation, supernova, gamma-ray burst, electric power blackouts, etc.) to see if a Renewal Process (or even Poisson Process) is a good model. This project may have a lower degree of difficulty. To increase the difficulty, look up some references on formal statistical tests for renewal processes; ask to see Prof. Ross's copy of the Bhat textbook for more info.

* Nonlinear Renewal Theory

Read books and articles about Nonlinear Renewal Theory, translate the theoretical theorem statements to something understandable, and come up with a few numerical examples.

* Two-dimensional Renewal Theory

Read books and articles about two-dimensional Renewal Theory, translate the theoretical theorem statements to something understandable, and come up with a few numerical examples. Give examples of real-life situations where it's useful. What is the equivalent of the Central Limit Theorem for Renewal Processes (CLTRP)? What is the equivalent of the Inspection Paradox? Starting references: "Renewal Theory in Two Dimensions: Basic Results" by Hunter, "Renewal Theory in Two Dimensions: Asymptotic Results" by Hunter (Basic Results also has some interesting links to bivariate exponential distributions), and "What is a multi-parameter renewal process?" by Ivanoff, B. Gail and Merzbach, Ely

Financial Engineering: Is it Brownian Motion?

Review articles that claim some physical or financial process follows some kind of Brownian Motion. Find a data set and do your own analysis. Try the Dow Jones average, individual stocks, or even weather data. This project may have a lower degree of difficulty, unless you research formal statistical methods for detecting Brownian motion.

Parking Lot Strategies, single lot

A recent paper analyzed some strategies for finding a good parking space in a single lot. Read it (get a copy from me) and do some extensions.

Parking Lot Strategies, multiple lots

Which parking lot should I try first when I drive to campus? If that one fails, which one should I try afterwards? Things to contemplate: my destination on campus, the random time at which each lot fills up, what time it is now, how long it takes to determine if a lot is full, time to drive from one lot to another, etc. My goal might be to minimize walking distance, or it might be to maximize my probability of making it to class at a certain time. This is something like a Dynamic Program, but maybe not exactly.

* Cell-Phone Traffic Models

In queueing theory, it's important to know how long people stay on the telephone. In cell-phone networks, things are complicated by people moving between cells. Start by reading http://en.wikipedia.org/wiki/Cellular_traffic and maybe do a little simulation.

Disneyland FastPass

Research how FastPass works for roller-coaster lines. Find any academic articles on it. If none exist, do some stochastic modeling of the situation.

Yield Management

Look up some seminal papers on Yield Management, and describe the state of the art. Tackle a small problem yourself. Maybe work with the Lehigh undergrad admissions office?

Financial Engineering: Stochastic Calculus lecture

Prepare and give a 50-minute lecture introducing the main ideas of Stochastic Calculus with PowerPoint or LaTeX slides.

Case-study from DVD

I have two or three DVDs that have hourlong cases. Try to do one of them.

Renewal Theory lecture

Prepare and give a 50-minute lecture on renewal theory. Include the nice intuitive results and the not-so-intuitive results related to the Inspection Paradox. Also include the offset terms for the renewal function (and variance function).

Other lectures

Prepare and give a 50-minute lecture on some other part of the book: CTMCs, queueing theory, reliability theory, etc.

Further descriptions

Music

There are two prominent music file formats. The first is MIDI, but it is kind of difficult to work with. Another is called "ABC", and you can find a lot more about it here . It might be easier to use, as it is all text-based.

Board-game projects

For those of you who are doing board-game projects, a big question is what the goal of your analysis should be. Let's consider a pure chance game with no strategy, like Candyland or Snakes-and-Ladders.

A steady-state DTMC analysis is not possible, because it is a terminating chain. A DTMC analysis would reveal the mean number of turns until finishing from any board position. With a little more work, you can get the whole distribution (and thus the variance). Some questions then are:

What is the probability that the first player wins, if there are 2 players?
If there are more players, how do those probabilities change?
Do a Normal approximation to winning probability using the mean and variance of turns-to-finish from square 1.
Plot the mean turns-till-finish, plus and minus 1 or 2 sigma, for each square on the board.
Plot a representation of your DTMC matrix (for example, using spy() in Matlab)

DNA or Proteins

I don't know if you have a preference for working with human data or with some other organism. My wife is most familiar with the yeast genome database, so that is what I will suggest to start with. The yeast we are talking about is the one that is used in making beer or in making bread. Go to http://db.yeastgenome.org/cgi-bin/SGD/seqTools and type "tif4631" (without the quotes) in the "Enter a Name" box. This is a gene that encodes a protein named EIF4G that is involved in "translation", the process of turning RNA into proteins. Hit "submit", and you get the results page. In the first row of the table, "DNA of Region", click on "NoHeader" to get the AGTC stuff without other formatting information. Also, in the 3rd row of the table, "Protein Translation of ORF", click on "NoHeader" to get the amino acid sequence that this gene codes for. There are 20 amino acids, each represented by a letter like M, T, D, E, etc.

By the way, ORF stands for Open Reading Frame. It basically means some sequence of DNA that starts with a particular 3-letter sequence that means "start reading here".

I think it would be a good idea to focus on data from defined genes, but if you want more data than that, you could always start with Chromosome One, Position One by starting at that same page http://db.yeastgenome.org/cgi-bin/SGD/seqTools and in Box 2, "Pick a Chromosome", pick chromosome 1. Do the same "NoHeader" thing, and you'll get a lot more data to play with. I'm not sure what the * means in the protein data; probably "stop" , but we could treat it just like any other symbol.

If you are more interested in working with human data, try starting at http://www.ncbi.nlm.nih.gov/ but from that point, you know as much as I do.

Babies Learning Language

Start by reading this web page: The Why Files . They don't actually say "Markov chain", but I think as you read it you will get the idea.

One of the prominent people in the field seems to be Prof. Saffran . Perhaps reading some of the papers she lists will be a good place to start. I'm pretty sure Lehigh has on-line access to the journal Science, at least.

Statistical Learning by 8-month-old Infants,
Jenny Saffran et al,
Science, December, 1996, v. 274, pp. 1926-8.

Word Segmentation: The Role of Distributional Clues,
Jenny Saffran et al,
 Journal of Memory and Language, Aug. 1996, pp. 606-621.

Counter-Intuitive Behavior in Queues

Places to look (I have copies of most of these):

Wolff's Stochastic Modeling and the Theory of Queues has a few examples
Cooper's Introduction to Queueing Theory, 3rd ed. pgs 298-301.
"Resource Sharing for Efficiency in Traffic Systems" by D.R. Smith and Ward Whitt, 1981, BSTJ
Davis, Massey, and Whitt 1995
Shortle et al., various parametrizations of Paretos
various Whitt papers?
look through QUESTA abstracts?

Deprecated Project Ideas

The following project ideas come from previous years, and are no longer encouraged as projects. But I didn't want to delete them from the list entirely, so here they are.

Last-In First-Out waiting time distribution: The waiting time distribution for an M/M/1 queue with First-In First-Out (FIFO) service is easy. Last-In First-Out (LIFO) is more complicated, and I haven't seen nice graphs of it. Same for Service-In-Random-Order (SIRO). Either take the existing formulas and find good ways to evaluate them numerically, or do transient CTMC ODEs to find the numbers. Find the variance, and make some nice graphs. Also plot CDFs to show if they are stochastically ordered. Also, plot sample paths under FIFO, LIFO, and SIRO for a set of simulated data.
A CTMC from Rental Truck Networks: Prof. Ross came across a two-dimensional CTMC when pondering how networks of rental truck depots work. He can explain it to you; you try to find a form for the steady-state distribution. Also applicable to networks of modem banks and networks of spare-part supply depots.
A CTMC on Highway Accidents: Read two papers by Prof. Melike Gursoy of Rutgers; program a three- dimensional CTMC (probably in Matlab) and find steady-state solutions; see if they correspond to a guess.
Modeling the spread of rumors: Read the article "Rumours and Errours", Brian Hayes, in American Scientist volume 93 number 3 (May-June 2005) 207-211, and do some experiments or extensions related to it. A review of the article appears in the College Math Journal, Vol 37 number 2, March 2006, page 155-156. Modeling the spread of rumors is very much like modeling the spread of diseases, so this is a good project for bio-oriented people.